Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems

نویسندگان

  • Satinder P. Singh
  • Dimitri P. Bertsekas
چکیده

In cellular telephone systems, an important problem is to dynamically allocate the communication resource (channels) so as to maximize service in a stochastic caller environment. This problem is naturally formulated as a dynamic programming problem and we use a reinforcement learning (RL) method to find dynamic channel allocation policies that are better than previous heuristic solutions. The policies obtained perform well for a broad variety of call traffic patterns. We present results on a large cellular system with approximately 4949 states. In cellular communication systems, an important problem is to allocate the communication resource (bandwidth) so as to maximize the service provided to a set of mobile callers whose demand for service changes stochastically. A given geographical area is divided into mutually disjoint cells, and each cell serves the calls that are within its boundaries (see Figure 1a). The total system bandwidth is divided into channels, with each channel centered around a frequency. Each channel can be used simultaneously at different cells, provided these cells are sufficiently separated spatially, so that there is no interference between them. The minimum separation distance between simultaneous reuse of the same channel is called the channel reuse constraint. When a call requests service in a given cell either a free channel (one that does not violate the channel reuse constraint) may be assigned to the call, or else the call is blocked from the system; this will happen if no free channel can be found. Also, when a mobile caller crosses from one cell to another, the call is "handed off" to the cell of entry; that is, a new free channel is provided to the call at the new cell. If no such channel is available, the call must be dropped/disconnected from the system. RLfor Dynamic Channel Allocation 975 One objective of a channel allocation policy is to allocate the available channels to calls so that the number of blocked calls is minimized. An additional objective is to minimize the number of calls that are dropped when they are handed off to a busy cell. These two objectives must be weighted appropriately to reflect their relative importance, since dropping existing calls is generally more undesirable than blocking new calls. To illustrate the qualitative nature of the channel assignment decisions, suppose that there are only two channels and three cells arranged in a line. Assume a channel reuse constraint of 2, i.e., a channel may be used simultaneously in cells 1 and 3, but may not be used in channel 2 if it is already used in cell 1 or in cell 3. Suppose that the system is serving one call in cell 1 and another call in cell 3. Then serving both calls on the same channel results in a better channel usage pattern than serving them on different channels, since in the former case the other channel is free to be used in cell 2. The purpose of the channel assignment and channel rearrangement strategy is, roughly speaking, to create such favorable usage patterns that minimize the likelihood of calls being blocked. We formulate the channel assignment problem as a dynamic programming problem, which, however, is too complex to be solved exactly. We introduce approximations based on the methodology of reinforcement learning (RL) (e.g., Barto, Bradtke and Singh, 1995, or the recent textbook by Bertsekas and Tsitsiklis, 1996). Our method learns channel allocation policies that outperform not only the most commonly used policy in cellular systems, but also the best heuristic policy we could find in the literature. 1 CHANNEL ASSIGNMENT POLICIES Many cellular systems are based on a fixed assignment (FA) channel allocation; that is, the set of channels is partitioned, and the partitions are permanently assigned to cells so that all cells can use all the channels assigned to them simultaneously without interference (see Figure 1a). When a call arrives in a cell, if any preassigned channel is unused; it is assigned, else the call is blocked. No rearrangement is done when a call terminates. Such a policy is static and cannot take advantage of temporary stochastic variations in demand for service. More efficient are dynamic channel allocation policies, which assign channels to different cells, so that every channel is available to every cell on a need basis, unless the channel is used in a nearby cell and the reuse constraint is violated. The best existing dynamic channel allocation policy we found in the literature is Borrowing with Directional Channel Locking (BDCL) of Zhang & Yum (1989). It numbers the channels from 1 to N, partitions and assigns them to cells as in FA. The channels assigned to a cell are its nominal channels. If a nominal channel is available when a call arrives in a cell, the smallest numbered such channel is assigned to the call. If no nominal channel is available, then the largest numbered free channel is borrowed from the neighbour with the most free channels. When a channel is borrowed, careful accounting of the directional effect of which cells can no longer use that channel because of interference is done. The call is blocked if there are no free channels at all. When a call terminates in a cell and the channel so freed is a nominal channel, say numbered i, of that cell, then if there is a call in that cell on a borrowed channel, the call on the smallest numbered borrowed channel is reassigned to i and the borrowed channel is returned to the appropriate cell. If there is no call on a borrowed channel, then if there is a call on a nominal channel numbered larger than i, the call on the highest numbered nominal channel is reassigned to i. If the call just terminated was itself on a borrowed channel, the 976 S. Singh and D. Bertsekas call on the smallest numbered borrowed channel is reassigned to it and that channel is returned to the cell from which it was borrowed. Notice that when a borrowed channel is returned to its original cell, a nominal channel becomes free in that cell and triggers a reassignment. Thus, in the worst case a call termination in one cell can sequentially cause reassignments in arbitrarily far away cells making BDCL somewhat impractical. BOCL is quite sophisticated and combines the notions of channel-ordering, nominal channels, and channel borrowing. Zhang and Yum (1989) show that BOCL is superior to its competitors, including FA. Generally, BOCL has continued to be highly regarded in the literature as a powerful heuristic (Enrico et.al., 1996) . In this paper, we compare the performance of dynamic channel allocation policies learned by RL with both FA and BOCL. 1.1 DYNAMIC PROGRAMMING FORMULATION We can formulate the dynamic channel allocation problem using dynamic programming (e.g., Bertsekas, 1995). State transitions occur when channels become free due to call departures, or when a call arrives at a given cell and wishes to be assigned a channel, or when there is a handoff, which can be viewed as a simultaneous call departure from one cell and a call arrival at another cell. The state at each time consists of two components: (1) The list of occupied and unoccupied channels at each cell. We call this the configuration of the cellular system. It is exponential in the number of cells. (2) The event that causes the state transition (arrival, departure, or handoff). This component of the state is uncontrollable. The decision/control applied at the time of a call departure is the rearrangement of the channels in use with the aim of creating a more favorable channel packing pattern among the cells (one that will leave more channels free for future assignments) . Unlike BDCL, our RL solution will restrict this rearrangement to the cell with the current call departure. The control exercised at the time of a call arrival is the assignment of a free channel, or the blocking of the call if no free channel is currently available. In general, it may also be useful to do admission control, i.e., to allow the possibility of not accepting a new call even when there exists a free channel to minimize the dropping of ongoing calls during handoff in the future. We address admission control in a separate paper and here restrict ourselves to always accepting a call if a free channel is available. The objective is to learn a policy that assigns decisions (assignment or rearrangement depending on event) to each state so as to maximize J = E {lCO ef3t e(t)dt} , where E{-} is the expectation operator, e(t) is the number of ongoing calls at time t, and j3 is a discount factor that makes immediate profit more valuable than future profit. Maximizing J is equivalent to minimizing the expected (discounted) number of blocked calls over an infinite horizon. 2 REINFORCEMENT LEARNING SOLUTION RL methods solve optimal control (or dynamic programming) problems by learning good approximations to the optimal value function, J*, given by the solution to RLfor Dynamic Channel Allocation 977 the Bellman optimality equation which takes the following form for the dynamic channel allocation problem: J(x) Ee{ max [E~dc(x,a,~t)+i(~t)J(Y)}]}, aEA(r ,e) (1) where x is a configuration, e is the random event (a call arrival or departure), A( x, e) is the set of actions available in the current state (x, e), ~t is the random time until the next event, c(x, a, ~t) is the effective immediate payoff with the discounting, and i(~t) is the effective discount for the next configuration y. RL learns approximations to J* using Sutton's (1988) temporal difference (TD(O)) algorithm. A fixed feature extractor is used to form an approximate compact representation of the exponential configuration of the cellular array. This approximate representation forms the input to a function approximator (see Figure 1) that learns/stores estimates of J*. No partitioning of channels is done; all channels are available in each cell. On each event, the estimates of J* are used both to make decisions and to update the estimates themselves as follows: Call Arrival: When a call arrives, evaluate the next configuration for each free channel and assign the channel that leads to the configuration with the largest estimated value. If there is no free channel at all, no decision has to be made. Call Termination: When a call terminates, one by one each ongoing call in that cell is considered for reassignment to the just freed channel; the resulting configurations are evaluated and compared to the value of not doing any reassignment at all. The action that leads to the highest value configuration is then executed. On call arrival, as long as there is a free channel, the number of ongoing calls and the time to next event do not depend on the free channel assigned. Similarly, the number of ongoing calls and the time to next event do not depend on the rearrangement done on call departure. Therefore, both the sample immediate payoff which depends on the number of ongoing calls and the time to next event, and the effective discount factor which depends only on the time to next event are independent of the choice of action. Thus one can choose the current best action by simply considering the estimated values of the next configurations. The next configuration for each action is deterministic and trivial to compute. When the next random event occurs, the sample payoff and the discount factor become available and are used to update the value function as follows: on a transition from configuration x to y on action a in time ~t, (1a)Jo/d(X) + a (c(x, a, ~t) + i(~t)Jo/d(y» (2) where x is used to indicate the approximate feature-based representation of x. The parameters ofthe function approximator are then updated to best represent Jnew(x) using gradient descent in mean-squared error (Jnew(x) JO/d(x»2 . 3 SIMULATION RESULTS Call arrivals are modeled as Poisson processes with a separate mean for each cell, and call durations are modeled with an exponential distribution . The first set of results are on the 7 by 7 cellular array of Figure ??a with 70 channels (roughly 7049 configurations) and a channel reuse constraint of 3 (this problem is borrowed from Zhang and Yum's (1989) paper on an empirical comparison of BDCL and its competitors) . Figures 2a, b & c are for uniform call arrival rates of 150, 200, and 350 calls/hr respectively in each cell. The mean call duration for all the experiments 978 S. Singh and D. Bertsekas reported here is 3 minutes. Figure 2d is for non-uniform call arrival rates. Each curve plots the cumulative empirical blocking probability as a function of simulated time. Each data point is therefore the percentage of system-wide calls that were blocked up until that point in time. All simulations start with no ongoing calls.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reinforcement Learning for DynamicChannel Allocation in Cellular

In cellular telephone systems, an important problem is to dynamically allocate the communication resource (channels) so as to maximize service in a stochastic caller environment. This problem is naturally formulated as a dynamic programming problem and we use a reinforcement learning (RL) method to nd dynamic channel allocation policies that are better than previous heuristic solutions. The pol...

متن کامل

Solving Semi-Markov Decision Problems using Average Reward Reinforcement Learning

A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the classical MDP algorithms, such as value iteration and policy iteration, is prohibitive and can grow ...

متن کامل

Dynamic Obstacle Avoidance by Distributed Algorithm based on Reinforcement Learning (RESEARCH NOTE)

In this paper we focus on the application of reinforcement learning to obstacle avoidance in dynamic Environments in wireless sensor networks. A distributed algorithm based on reinforcement learning is developed for sensor networks to guide mobile robot through the dynamic obstacles. The sensor network models the danger of the area under coverage as obstacles, and has the property of adoption o...

متن کامل

Cache-Enabled Dynamic Rate Allocation via Deep Self-Transfer Reinforcement Learning

Caching and rate allocation are two promising approaches to support video streaming over wireless network. However, existing rate allocation designs do not fully exploit the advantages of the two approaches. This paper investigates the problem of cache-enabled QoE-driven video rate allocation problem. We establish a mathematical model for this problem, and point out that it is difficult to solv...

متن کامل

A Simple Reinforcement Learning Mechanism for Resource Allocation in LTE-A Networks with Markov Decision Process and Q-Learning

Resource allocation is still a difficult issue to deal with in wireless networks. The unstable channel condition and traffic demand for Quality of Service (QoS) raise some barriers that interfere with the process. It is significant that an optimal policy takes into account some resources available to each traffic class while considering the spectral efficiency and other related channel issues. ...

متن کامل

A Novel Dynamic Spectrum Access Framework Based on Reinforcement Learning for Cognitive Radio Sensor Networks

Cognitive radio sensor networks are one of the kinds of application where cognitive techniques can be adopted and have many potential applications, challenges and future research trends. According to the research surveys, dynamic spectrum access is an important and necessary technology for future cognitive sensor networks. Traditional methods of dynamic spectrum access are based on spectrum hol...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996